Dataset is located at: https://www.kaggle.com/GoogleNewsLab/health-searches-us-county
Vaccinations have saved millions of lives globally by eradicating diseases such as smallpox in the 1970's. There has been increasing interest in vaccinations on Google from 2008 through 2017, and while there has been broader coverage of vaccination across the US, there are still many children who are not vaccinated (https://stacks.cdc.gov/view/cdc/59413).
There may be many reasons for looking up vaccines on Google: preparation for travel abroad, an attempt to gain more information about specific vaccinations for children, pets, research, etc. The list is endless, but there can still be value in finding how interest in vaccines is changing in the U.S. over time, and how the interest varies by state.
While Google can provide great resources and general information, it does not always provide credible information and many times will confuse people with conflicting information across various websites. Education about vaccines should be provided by medical professionals.
This analysis has the potential to be turned into a tool with Google's search API which could provide realtime information for medical health providers or organizations that aim to provide professional advice and education about vaccination. Medical providers could determine which US states have the highest interest in vaccination, and therefore help to determine where resources should be allocated for maximum efficacy. If interest is high in certain states, the medical health providers or organizations could begin researching why interest is high or low with surveys for patients and people in various communities. Based on survey results, the health providers could decide which locations need vaccine education the most.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import requests
import zipfile
import io
from collections import Counter
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error
from plotly.offline import iplot
import plotly.express as px
import plotly.graph_objects as go
import warnings
from numpy.linalg import LinAlgError
import time
# install watermark extension
!pip install watermark
# once it is installed, you'll just need this in future notebooks:
%load_ext watermark
%watermark -a "Chantel Clark" -d -t -v -p numpy,pandas,matplotlib,seaborn
This cell with open a window where you can select the data csv file you downloaded.
"""
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
"""
#file='RegionalInterestByConditionOverTime.csv'
file = r'C:\Users\chant\OneDrive\Documents\Springboard\Data\RegionalInterestByConditionOverTime.csv'
healthSearchData=pd.read_csv(file)
healthSearchData.head()
# Check for null values
for col in healthSearchData.columns:
bool_series = pd.isnull(healthSearchData[col])
if len(healthSearchData[bool_series].index.values) != 0:
print(healthSearchData[bool_series].index.values)
# Create a 'state_abbrev' column by extracting state, then map state to U.S. region
temp= [] # initialize empty list
for i in healthSearchData['dma']:
temp.append(i[-2:]) # add last two letters of dma to temp list
healthSearchData['state_abbrev'] = temp # create new column 'state_abbrev'
# Fix index 11 state_abbrev, dma formatting different from other dma's
healthSearchData.set_value(11, 'state_abbrev', 'DC')
healthSearchData.iloc[11]
# Dictionary of state abbreviations to region ('West', 'Midwest', 'Northeast', 'South', and 'O' = other)
states_region = {
'AK': 'West',
'AL': 'South',
'AR': 'South',
'AS': 'O',
'AZ': 'West',
'CA': 'West',
'CO': 'West',
'CT': 'Northeast',
'DC': 'South',
'DE': 'Northeast',
'FL': 'South',
'GA': 'South',
'GU': 'O',
'HI': 'West',
'IA': 'Midwest',
'ID': 'West',
'IL': 'Midwest',
'IN': 'Midwest',
'KS': 'Midwest',
'KY': 'South',
'LA': 'South',
'MA': 'Northeast',
'MD': 'South',
'ME': 'Northeast',
'MI': 'Midwest',
'MN': 'Midwest',
'MO': 'Midwest',
'MP': 'O',
'MS': 'South',
'MT': 'West',
'NA': 'O',
'NC': 'South',
'ND': 'Midwest',
'NE': 'Midwest',
'NH': 'Northeast',
'NJ': 'Northeast',
'NM': 'West',
'NV': 'West',
'NY': 'Northeast',
'OH': 'Midwest',
'OK': 'South',
'OR': 'West',
'PA': 'Northeast',
'PR': 'O',
'RI': 'Northeast',
'SC': 'South',
'SD': 'Midwest',
'TN': 'South',
'TX': 'South',
'UT': 'West',
'VA': 'South',
'VI': 'O',
'VT': 'Northeast',
'WA': 'West',
'WI': 'Midwest',
'WV': 'South',
'WY': 'West'
}
# Create a column 'region' that maps state abbreviation to region
healthSearchData['region'] = healthSearchData['state_abbrev'].map(states_region)
healthSearchData.head()
# Drop the 'geoCode' column
healthSearchData = healthSearchData.drop('geoCode', axis=1)
healthSearchData.info()
healthSearchData.describe().T.head()
The data here comes from 210 metropolitan cities from across the U.S. Scores have a range of 0-100, and is described on Google Search API as an 'interest score'. They are proportional to the fraction of all Google searches, so a score of 100 represents very high interest, and a score of 50 is half as popular as the previous search. A score of 0 indicates that there was not enough data.
Caution must be used in interpreting this data; it is not possible to obtain the total count of searches because the scores are proportional. This means that a larger city with half of the queries related to vaccines would receive a score of 50, and a smaller city where $\frac{3}{4}$ of queries were related to vaccines would receive a score of 75.
Because the data includes proportional health interest scores and not total counts, the best way to use this data would be to look at relative increases and decreases in the interest scores. For example:
This way, even if there were a larger volume of searches in the most recent years because of increased internet use, the mean score should not be greatly affected by the increased volume of users. Realistically though, the increased access to internet for different types of users (of various ages and wealth classes) will influence the outcomes, and should be considered when interpreting the data.
The boxplot is useful for identifying outliers, but we can get more information from the swarmplots. Looking at all of the data as a swarmplot can help us to understand which region of the U.S. the outliers are coming from.
# Create 'vaccine' search dataframe with columns for year, count, region.
years = []
vacc_all = pd.DataFrame()
for col in healthSearchData.columns:
if '+' and 'vaccine' in col:
year = col.split('+')[0]
years.append(year)
add_col = healthSearchData[col]
vacc_all = pd.concat([vacc_all, add_col], axis=1)
vacc_all.describe().T
vacc_all.head()
# Rename columns with years only
vacc_all.columns = years
# Add 'region' column
vacc_all['region'] = healthSearchData['region']
# Melt dataframe to have only columns 'region', 'year', and 'value'
melted_df = pd.melt(vacc_all, id_vars='region', var_name='year' )
# Create a box plot for vaccine query
sns.set(font_scale=1.5)
sns.set_style('whitegrid')
fig = plt.gcf()
fig.set_size_inches(15, 8)
sns.boxplot("year", "value", data=melted_df)
plt.title('Vaccine Search Interest Scores')
plt.xlabel('Year')
plt.ylabel('Interest Score')
# Swarm plot
fig = plt.gcf()
fig.set_size_inches(15, 8)
ax = fig.add_subplot(111)
s = sns.swarmplot(x = 'year', y='value', hue='region', data=melted_df, dodge=True)
plt.title('Vaccine Search Interest Scores')
plt.xlabel('Year')
plt.ylabel('Interest Score')
# Place legend outside of plot
handles, labels = ax.get_legend_handles_labels()
l = plt.legend(handles[0:4], labels[0:4], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
# Count the number of zeros and total number of data points
num_zeros = len(melted_df[melted_df.value == 0])
total_n = len(melted_df)
# Add counts text to plot
s.text(10+0.2, 6.5, "zeros: " + str(num_zeros), horizontalalignment='left', size='medium', color='black', weight='semibold')
s.text(10+0.2, 2.5, "total n: " + str(total_n), horizontalalignment='left', size='medium', color='black', weight='semibold')
Scores of 100 come from all regions, with the least number of 100 scores from the Northeast. There was only one instance of a 100 score in 2004 from the Northeast.
There were a total of 32 zero scores, which is 1.1% of the vaccine subset. Only a small percentage have a score of zero, so it is fine to omit zero scores.
# vacc_all dataframe, each row is a city
# Add 'state' column
vacc_all['state'] = healthSearchData['state_abbrev']
vacc_all.head()
# Count the number of entries/cities per state
#from collections import Counter
Counter(vacc_all.state)
# Melt dataframe to have only columns 'region', 'state', 'year', and 'score'
melted_df = pd.melt(vacc_all, id_vars=['region','state'], var_name='year', value_name='score')
melted_df.head()
# Remove zeros, and replace with nan, before calculating state mean score
melted_df = melted_df.replace(0, np.nan)
# Find all of the locations with scores of 100
melted_df[melted_df.score == 100]
# Find the mean score for each state, in each year
grouped_df = melted_df.groupby(['year', 'state']).mean()
grouped_df.head()
# Reset index to make year and state columns
grouped_df.reset_index(inplace=True)
grouped_df.head()
grouped_df.head()
""""
def configure_plotly_browser_state():
import IPython
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
<script>
requirejs.config({
paths: {
base: '/static/base',
plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
},
});
</script>
'''))
configure_plotly_browser_state()
"""
# Create slider
data_slider = []
for year in grouped_df.year.unique():
# Select the year
one_year = grouped_df[grouped_df.year == year]
for col in grouped_df.columns: # Transform the columns into string type
one_year[col] = one_year[col].astype(str)
### Create the text for mouse-hover for each state, for the current year
one_year['text'] = one_year['state'] + '<br>Score: ' + one_year['score']
### Create the dictionary with the data for the current year
data_one_year = dict(
type='choropleth',
locations = one_year['state'],
z=one_year['score'].astype(float),
locationmode='USA-states',
zmin=20,
zmax=100,
text = one_year['text'],
hoverinfo='text',
)
data_slider.append(data_one_year) # Add the dictionary to the list of dictionaries for the slider
# Create the steps for the slider
steps = []
for i in range(len(data_slider)):
step = dict(method='restyle',
args=['visible', [False] * len(data_slider)],
label='Year {}'.format(i + 2004)) # label to be displayed for each step (year)
step['args'][1][i] = True
steps.append(step)
# Create the 'sliders' object from the 'steps'
sliders = [dict(active=0, pad={"t": 1}, steps=steps)]
#from plotly.offline import iplot
#configure_plotly_browser_state()
#set up the layout (including slider option)
layout = dict(geo=dict(scope='usa',
projection={'type': 'albers usa'},
showlakes=False),
sliders=sliders)
# create the figure object:
fig = dict(data=data_slider, layout=layout)
# plot
iplot(fig)
Interest across the states drop during 2008, with a large increase from 2009-2011. While the Northeast had the least number of high 100 scores, the Northeast appears to consistently have high interest in vaccines compared to the South.
state_df = grouped_df.pivot(index='year', columns='state', values='score')
state_df.head()
# Is there a correlation between states?
# Create a correlation / heat map
sns.set(font_scale=1.25)
# Create a correlation matrix of means
corr = state_df.corr(method='pearson')
# Seaborn heatmap
plt.figure(figsize=(17,17))
ax = sns.heatmap(
corr,
vmin=0.6, vmax=1, center=0.8,
cmap='coolwarm',
square=True,
robust=True,
annot_kws={"size":10},
cbar_kws={"shrink": 0.5}
)
ax.set_xticklabels(
ax.get_xticklabels(),
rotation=90
);
plt.title('Pearson Correlation Matrix of Mean Scores', fontdict = {'fontsize' : 20})
# What is going on with Mississippi, are the scores from MS really different from the other states?
MS = grouped_df[grouped_df.state == 'MS']
AL = grouped_df[grouped_df.state == 'AL']
CO = grouped_df[grouped_df.state == 'CO']
# Plot MS mean scores over time
sns.set(font_scale=1.25)
sns.set_style('whitegrid')
plt.figure(figsize=(12,5))
plt.plot(MS.year, MS.score)
plt.plot(AL.year, AL.score)
plt.plot(CO.year, CO.score)
plt.ylim(0,100)
plt.title('Mississippi vs. Alabama Vaccine Interest')
plt.xlabel('Year')
plt.ylabel('Average Score')
plt.legend(loc='best', labels=['MS', 'AL','CO'])
vacc_all.head()
# Is there a correlation between years?
# Create a correlation / heat map
sns.set(font_scale=1.25)
# Create a correlation matrix of means
corr = vacc_all.corr(method='pearson')
# Seaborn heatmap
plt.figure(figsize=(10,10))
ax = sns.heatmap(
corr,
vmin=0, vmax=1, center=0.5,
cmap='coolwarm',
square=True,
robust=True,
annot=True,
annot_kws={"size":10},
cbar_kws={"shrink": 0.5}
)
ax.set_xticklabels(
ax.get_xticklabels(),
rotation=60,
horizontalalignment='right'
);
plt.title('Pearson Correlation Matrix of Mean Scores', fontdict = {'fontsize' : 20})
The data has low correlation from 2004-2010, with increasing correlation for years 2011-2017.
grouped_df.head()
# Subset California data
ca_df = grouped_df[grouped_df.state == 'CA']
ca_df = ca_df.drop('state', axis=1)
ca_df.head()
sns.set(font_scale=1.25)
sns.set_style("whitegrid")
plt.figure(figsize=(12,5))
ax = plt.subplot(111)
plt.plot(ca_df.year, ca_df.score)
plt.ylim(0,100)
plt.title("California Vaccine Search Interest", fontsize=20)
plt.xlabel('Year')
plt.ylabel('Mean Score')
# Use datetime formatting
ca_df['year'] = pd.to_datetime(ca_df['year'], format="%Y")
# Rename column 'year' to 'date'
ca_df.columns = ['date', 'score']
ca_df.head()
ca_df.info()
# Set year as index
ca_df.set_index('date', inplace=True)
ca_df.head()
# Interpolate to increase
interp_df = ca_df.resample('M').mean().interpolate(method='linear')
interp_df.head(13)
# Calculate the rolling statistics with window using 3 observations
roll_mean = interp_df.rolling(window=3).mean()
roll_sd = interp_df.rolling(window=3).std()
#roll_mean.head()
interp_df.info()
# Plot the data with rolling stats
plt.figure(figsize=(15,6))
plt.plot(interp_df, label='Original Data')
plt.plot(roll_mean, color='red', label='Rolling Mean')
plt.plot(roll_sd, color='black', label='Rolling Standard Deviation')
plt.title("Rolling Mean and Standard Deviation")
plt.legend(loc='best', fontsize='small')
# Dickey-Fuller test
# from statsmodels.tsa.stattools import adfuller
dftest = adfuller(interp_df['score'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', 'n_lags', 'n_observations'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
def test_stationarity(timeseries):
# Calculate the rolling statistics with window using 3 observations
roll_mean = timeseries.rolling(window=3).mean()
roll_sd = timeseries.rolling(window=3).std()
# Plot
plt.figure(figsize=(15,6))
plt.plot(timeseries, label='Data')
plt.plot(roll_mean, color='red', label='Rolling Mean')
plt.plot(roll_sd, color='black', label='Rolling Standard Deviation')
plt.title("Rolling Mean and Standard Deviation")
plt.legend(loc='best', fontsize='small')
# Dickey-Fuller test
dftest = adfuller(timeseries['score'], autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', 'n_lags', 'n_observations'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
# Create 1st and 2nd order differences
diff = interp_df.diff()
diff.dropna(inplace=True)
diff2 = interp_df.diff().diff()
diff2.dropna(inplace=True)
log_df = np.log(interp_df)
log_diff = log_df - log_df.shift()
log_diff.dropna(inplace=True)
test_stationarity(diff)
test_stationarity(diff2)
test_stationarity(log_diff)
# Plot Data, Residuals and ACF functions
#from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Original Series
fig, ax = plt.subplots(3, 2, figsize=(15,8))
plot = ax[0,0].plot(ca_df)
ax[0,0].set_title('Original Series')
acf=plot_acf(interp_df, ax=ax[0,1])
plot2 = ax[1,0].plot(diff)
ax[1,0].set_title('1st Difference')
acf2=plot_acf(diff, ax=ax[1,1])
plot3 = ax[2,0].plot(diff2)
ax[2,0].set_title('2nd Difference')
acf2=plot_acf(diff2, ax=ax[2,1])
plt.tight_layout()
# Original Series
fig, ax = plt.subplots(3, 2, figsize=(15,8))
plot = ax[0,0].plot(interp_df)
ax[0,0].set_title('Original Series')
pacf=plot_pacf(interp_df, lags=15, ax=ax[0,1])
#pacf = plot_pacf(ca_df, lags=15)
plot2 = ax[1,0].plot(diff)
ax[1,0].set_title('1st Difference')
pacf2=plot_pacf(diff, lags=15, ax=ax[1,1])
plot3 = ax[2,0].plot(diff2)
ax[2,0].set_title('2nd Difference')
pacf2=plot_pacf(diff2, lags=15, ax=ax[2,1])
plt.tight_layout()
Try using 1 lag for parameter p, since the first lag is outside of significance bounds.
# ARIMA (Autoregressive Integrated Moving Average) model is used for timeseries forecasting
#from statsmodels.tsa.arima_model import ARIMA
# Fit model
model = ARIMA(interp_df, order=(1,1,5)) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
results_AR = model.fit(disp=-1, )
print(results_AR.summary())
The AIC score is 94.671, and we will compare this value with other orders of the ARIMA model. Minimizing the AIC score equates to a better fit model.
# Loop through values for parameters p,d,q to determine which model produces the lowest AIC
# Print warnings only once
#import warnings
warnings.filterwarnings(action='once')
ps = [0,1,2,3,4,5]
ds=[0,1,2]
qs = [0,1,2,3,4,5]
best_aic = 100
best_p = 0
best_d = 0
best_q = 0
for p in ps:
for d in ds:
for q in qs:
try:
model = ARIMA(interp_df, order=(p,d,q)) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
results_AR = model.fit(disp=-1)
aic = results_AR.aic
if aic < best_aic:
best_p = p
best_d = d
best_q = q
best_aic = aic
print(p,d, q, aic)
except:
pass
'The best p,d,q = {}, {}, {}, with AIC = {}'.format(best_p, best_d, best_q, best_aic)
print(best_p, best_d, best_q)
The best model has order (5,1,4) and an AIC score of 81.366, which is significantly lower than the first model with order (1,1,5). ARIMA model (5,1,4) will be used to fit the California subset of scores.
# Fit model
model = ARIMA(interp_df, order=((5,1,3))) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
results_AR = model.fit(disp=-1)
print(results_AR.summary())
# Fit model
model = ARIMA(interp_df, order=(best_p, best_d, best_q)) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
results_AR = model.fit(disp=-1)
print(results_AR.summary())
The best AIC scores are given when using ARIMA orders of (5,1,4) and (5,1,3). The scores are 81.366 and 84.318 respectively. Both of these scores are significantly better than the initial model with order (1,1,5).
Looking at the p-values of coefficients, the first model with order (5,1,4) has p-values of 0 except for 2 p-values greater than 0.8 for an autoregressive and moving average coefficient. The order (5,1,4) model also resulted in p-values of 0 except for 2 p-values between 0.1 and 0.45. Both models have a couple of coefficients that are not playing a significant part in the model.
# Plot residual errors
residuals = pd.DataFrame(results_AR.resid)
fig, ax = plt.subplots(1,2,figsize=(15,5))
residuals.plot(title="Residuals", ax=ax[0])
residuals.plot(kind='kde', title='Density', ax=ax[1])
# Remove y-label and legends
for ax in ax.flat:
ax.set(ylabel='')
ax.get_legend().remove()
plt.show()
The residuals look okay, with bigger spikes at the turn of each year due to using annal average scores, and interpolating linearly between years by month. Residual errors are roughly Gaussian and slightly skewed.
# Residual stats
residuals.describe()
# Actual vs Fitted
with plt.rc_context():
plt.rc("figure", figsize=(12,5))
results_AR.plot_predict(dynamic=False)
plt.title('Actual vs. Fitted')
plt.show()
# Split the timeseries into contiguous training and test datasets with a 75:25 ratio
split = int(len(interp_df)*.80)
train = interp_df[0:split]
len(train)
test = interp_df[split:]
len(test)
#from numpy.linalg import LinAlgError
# Build Model
try:
model = ARIMA(train, order=(best_p, best_d, best_q))
#model = ARIMA(train, order=(5,1,3))
fitted = model.fit(disp=-1)
except (ValueError, LinAlgError):
pass
# Forecast
fc, se, conf = fitted.forecast(len(test), alpha=0.05) # 95% conf
# Make as pandas series
fc_series = pd.Series(fc, index=test.index)
lower_series = pd.Series(conf[:, 0], index=test.index)
upper_series = pd.Series(conf[:, 1], index=test.index)
# Plot
plt.figure(figsize=(12,5))
plt.plot(train, label='training')
plt.plot(test, label='actual')
plt.plot(fc_series, label='forecast', color='green')
plt.fill_between(lower_series.index, lower_series, upper_series,
color='k', alpha=.15)
plt.ylim(0,100)
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=11)
plt.show()
def squared_error(true, predicted):
""" Finds the squared error between two points: true value and predicted value """
squared_err = (true - predicted)**2
return squared_err
# End of forecast root squared error
np.sqrt(squared_error(test.score[-1], fc_series[-1]))
# Find the Mean Squared Error for all data in test and forecasted series
#from sklearn.metrics import mean_squared_error
error = mean_squared_error(test, fc_series)
print('Total Test MSE: %.3f' % error)
MSE between the test set and forecasted set is: 7.843 with ARIMA order (5,1,4), and 8.533 with order (5,1,3).
# Evaluate model performance for one year forecasts.
# Create 4x3 subplots
# Calculate the root mean squared error (RMSE) of each 1 year forecast for the years 2007 through 2017.
# Then print output the mean RMSE.
# Create range of indices to split data on - splits yearly on January
splits = range(24,156,12)
all_rmse = []
fig = plt.figure(figsize=(15,10))
for i in range(len(splits)):
train = interp_df[0:splits[i]]
test = interp_df[splits[i]:]
# Build Model
try:
model = ARIMA(train, order=(best_p, best_d, best_q))
fitted = model.fit(disp=-1)
except (ValueError, LinAlgError):
pass
# Forecast
fc, se, conf = fitted.forecast(len(test), alpha=0.05) # 95% conf
# Make as pandas series
fc_series = pd.Series(fc, index=test.index)
lower_series = pd.Series(conf[:, 0], index=test.index)
upper_series = pd.Series(conf[:, 1], index=test.index)
# Calculate Root Mean Squared Error
se = squared_error(test.score[12], fc_series[12])
rmse = np.sqrt(se)
#print('The root mean squared error for 1 year forecast is', rmse)
all_rmse.append(rmse)
ax = fig.add_subplot(4,3,i+1)
plt.plot(train, label='training')
plt.plot(test, label='actual')
plt.plot(fc_series, label='forecast', color='green')
plt.fill_between(lower_series.index, lower_series, upper_series,
color='k', alpha=.15)
plt.ylim(0,100)
# Vertical line to highlight 1 year prediction vs. actual score
plt.axvline(test.index[12], color='red', linestyle='--')
plt.legend(loc='upper left', fontsize=11)
#plt.annotate('RMSE =' + str(rmse), (1,1))
#ax.annotate('RMSE =' + str(rmse), xy=(0.5,0.5))
fig.suptitle('Actual vs. Forecasted Score', y=1.05, fontsize='large')
# Set common labels
fig.text(0.5, 0.0, 'Year', ha='center', va='center', fontsize='medium')
fig.text(0.0, 0.5, 'Average Interest Score', ha='center', va='center', rotation='vertical', fontsize='medium')
plt.tight_layout()
np.mean(all_rmse)
# Create a dataframe of years and corresponding rmse
predicted_years = np.arange(2007, 2018)
rmse_df = pd.DataFrame({'year': predicted_years, 'rmse':all_rmse})
rmse_df.set_index('year', inplace=True)
#print(rmse_df)
rmse_df.T
(5,1,4) model: The mean squared error for all 1 year forecasts is 10.688652380747168
(5,1,3) model: The mean squared error for all 1 year forecasts is 10.24651994443601
When comparing the 1-year forecasts the ARIMA model with order (5,1,3) performs slightly better.
model = ARIMA(interp_df, order=(best_p, best_d, best_q)) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
results_AR = model.fit(disp=-1)
def plot_forecast(n_months=12):
""" Plot ARIMA forecast, n_months after January 2017 """
# Create new indices for forecast
end = interp_df.index[-1]
new_inds = end + np.arange(n_months) + 1
# Forecast using fitted model
fc, se, conf = results_AR.forecast(n_months, alpha=0.05) # 95% conf
# Make as pandas series
fc_series = pd.Series(fc, index=new_inds)
lower_series = pd.Series(conf[:, 0], index=new_inds)
upper_series = pd.Series(conf[:, 1], index=new_inds)
# Plot
plt.figure(figsize=(12,5))
plt.plot(interp_df, label='training')
plt.plot(fc_series, label='forecast', color='green')
plt.fill_between(lower_series.index, lower_series, upper_series,
color='k', alpha=.15)
plt.ylim(0,100)
plt.title('Forecast')
plt.xlabel('Year')
plt.ylabel('Mean Interest Score')
plt.legend(loc='upper left', fontsize=11)
plt.show()
print('The predicted score at end of forecast is =', fc_series[-1])
return fc_series
fc = plot_forecast(12)
fc[-1:]
interp_df.tail()
fc = plot_forecast(n_months=24)
fc = plot_forecast(n_months=36)
grouped_df.head()
len(set(grouped_df.state))
def get_state_subset(state_abbrev):
""" Gets a subset for the state, input is the two letter state abbreviation as a string """
state_df = grouped_df[grouped_df.state == state_abbrev]
state_df = state_df.drop('state', axis=1)
return state_df
def interpolate_df(df):
""" Converts year column into datetime object, and then upsamples by month and interpolates linearly """
# Use datetime formatting
df['year'] = pd.to_datetime(df['year'], format="%Y")
# Rename column 'year' to 'date'
df.columns = ['date', 'score']
df.set_index('date', inplace=True)
interp_df = df.resample('M').mean().interpolate(method='linear')
return interp_df
def get_best_model(interpolated_df):
""" Fits model to interpolated dataframe, and returns the best parameters for p,d,q, along with the fitted ARIMA model. """
ps = [0,1,2,3,4,5]
ds=[0,1,2]
qs = [0,1,2,3,4,5]
best_aic = 500
best_p = 0
best_d = 0
best_q = 0
for p in ps:
for d in ds:
for q in qs:
try:
model = ARIMA(interpolated_df, order=(p,d,q)) # (p,d,q) -> p = lag value for autoregression, d = difference order, q = size of moving average window
fitted_AR = model.fit(disp=-1)
aic = fitted_AR.aic
if aic < best_aic:
best_p = p
best_d = d
best_q = q
best_aic = aic
#print(p,d, q, aic)
except:
pass
model = ARIMA(interpolated_df, order=(best_p,best_d,best_q))
best_fitted = model.fit(disp=-1)
return best_p, best_d, best_q, best_fitted
def one_yr_forecast(interp_df, fitted_model, n_months=12):
""" Forecast, returns the forecasted value n_months after January 2017.
The default function will return a 12 month / 1 year forecast value. """
# Create new indices for forecast
end = interp_df.index[-1]
new_inds = end + np.arange(n_months) + 1
# Forecast using fitted model
fc, se, conf = fitted_model.forecast(n_months, alpha=0.05) # 95% conf
# Make as pandas series
fc_series = pd.Series(fc, index=new_inds)
#lower_series = pd.Series(conf[:, 0], index=new_inds)
#upper_series = pd.Series(conf[:, 1], index=new_inds)
#print('The predicted score at end of forecast is =', fc_series[-1])
return fc_series[-1]
def plot_forecast_2(interp_df, fitted_model, n_months=12):
""" Plot ARIMA forecast, n_months after January 2017 """
# Create new indices for forecast
end = interp_df.index[-1]
new_inds = end + np.arange(n_months) + 1
# Forecast using fitted model
fc, se, conf = fitted_model.forecast(n_months, alpha=0.05) # 95% conf
# Make as pandas series
fc_series = pd.Series(fc, index=new_inds)
lower_series = pd.Series(conf[:, 0], index=new_inds)
upper_series = pd.Series(conf[:, 1], index=new_inds)
# Plot
plt.figure(figsize=(12,5))
plt.plot(interp_df, label='training')
plt.plot(fc_series, label='forecast', color='green')
plt.fill_between(lower_series.index, lower_series, upper_series,
color='k', alpha=.15)
plt.ylim(0,100)
plt.title('Forecast')
plt.xlabel('Year')
plt.ylabel('Mean Interest Score')
plt.legend(loc='upper left', fontsize=11)
plt.show()
print('The predicted score at end of forecast is =', fc_series[-1])
return fc_series
state_df = get_state_subset('NC')
state_df.head()
interp_df = interpolate_df(state_df)
interp_df.head()
p,d,q,fitted_AR = get_best_model(interp_df)
print(p,d,q)
fc = plot_forecast_2(interp_df, fitted_AR, n_months=12)
one_yr_forecast(interp_df, fitted_AR, n_months=12)
nc = get_state_subset('NC')
nc_interp = interpolate_df(nc)
p,d,q,fitted = get_best_model(nc_interp)
nc_fc = one_yr_forecast(nc_interp, fitted)
#import time
predicted_df = pd.DataFrame(columns=['year', 'state', 'score']) # Initialize dataframe to collect predictions
predicted_yrs = [2018] # Year(s) to predict
# A loop to find the predicted value for all states in predicted_yrs (2018)
start = time.time() # Time the loop
for state in set(grouped_df.state):
for yr in predicted_yrs:
state_df = get_state_subset(state) # Get the subset for each state
interp_df = interpolate_df(state_df) # Interpolate the subset
p,d,q,fitted_AR = get_best_model(interp_df) # Get the best fit model and parameters (p,d,q)
fc = one_yr_forecast(interp_df, fitted_AR, n_months=12) # Get the 1 year forecasted value
predicted_df = predicted_df.append({'year':yr, 'state':state, 'score':fc}, ignore_index=True)
print("Total time taken this loop: ", time.time() - start) # Print the time it takes to complete the loop
predicted_df.head()
# Create a choropleth map
#import plotly.express as px # Be sure to import express
#import plotly.graph_objects as go
#configure_plotly_browser_state()
fig = px.choropleth(predicted_df, # Input Pandas DataFrame
locations="state", # DataFrame column with locations
color="score", # DataFrame column with color values
hover_name="state", # DataFrame column hover info
locationmode = 'USA-states', # Set location to US states
range_color=[20,100] # Set range for colorbar
)
fig.update_geos(showlakes=False) # Hide the lakes
fig.update_layout(
title_text = 'Predicted Interest Scores for Vaccines (2018)', # Create a Title
geo_scope='usa' # Plot only the USA instead of globe
)
fig.show() # Output the plot to the screen